Semi-supervised ranking for document retrieval

نویسندگان

  • Kevin Duh
  • Katrin Kirchhoff
چکیده

Ranking functions are an important component of information retrieval systems. Recently there has been a surge of research in the field of “learning to rank”, which aims at using labeled training data and machine learning algorithms to construct reliable ranking functions. Machine learning methods such as neural networks, support vector machines, and least squares have been successfully applied to ranking problems, and some are already being deployed in commercial search engines. Despite these successes, most algorithms to date construct ranking functions in a supervised learning setting, which assume that relevance labels are provided by human annotators prior to training the ranking function. Such methods may perform poorly when human relevance judgments are not available for a wide range of queries. In this paper, we examine whether additional unlabeled data, which is easy to obtain, can be used to improve supervised algorithms. In particular, we investigate the transductive setting, where the unlabeled data is equivalent to the test data. We propose a simple yet flexible transductive meta-algorithm: the key idea is to adapt the training procedure to each test list after observing the documents that need to be ranked. We investigate two instantiations of this general framework: The Feature Generation approach is based on discovering more salient features from the unlabeled test data and training a ranker on this test-dependent feature-set. The Importance Weighting approach is based on ideas in the domain adaptation literature, and works by re-weighting the training data to match the statistics of each test list. We demonstrate that both approaches improve over supervised algorithms on the TREC and OHSUMED tasks from the LETOR dataset.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Semi-Supervised Ensemble Ranking

Ranking plays a central role in many Web search and information retrieval applications. Ensemble ranking, sometimes called meta-search, aims to improve the retrieval performance by combining the outputs from multiple ranking algorithms. Many ensemble ranking approaches employ supervised learning techniques to learn appropriate weights for combining multiple rankers. The main shortcoming with th...

متن کامل

Semi-Supervised Information Retrieval System for Clinical Decision Support

This article summarizes the approach developed for TREC 2016 Clinical Decision Support Track. In order to address the daunting challenge of retrieval of biomedical articles for answering clinical questions, an information retrieval methodology was developed that combines pseudo-relevance feedback, semantic query expansion and document similarity measures based on unsupervised word embeddings. T...

متن کامل

Semi-supervised document retrieval

0306-4573/$ see front matter 2008 Elsevier Ltd doi:10.1016/j.ipm.2008.11.002 * Corresponding author. Tel./fax: +86 25 8368 62 E-mail address: [email protected] (Z.-H. Zhou) This paper proposes a new machine learning method for constructing ranking models in document retrieval. The method, which is referred to as SSRANK, aims to use the advantages of both the traditional Information Retrieval (I...

متن کامل

Information Retrieval Using Label Propagation Based Ranking

The IR group participated in the crosslanguage retrieval task (CLIR) at the sixth NTCIR workshop (NTCIR 6). In this paper, we describe our approach on Chinese Single Language Information Retrieval (SLIR) task and English-Chinese Bilingual CLIR task (BLIR). We use both bi-grams and single Chinese characters as index units and use OKAPI BM25 as retrieval model. The initial retrieved documents are...

متن کامل

TUTA1 at the NTCIR-11 Temporalia Task

This paper details our participation in the NTCIR-11 Temporalia task including Temporal Query Intent Classification (TQIC) and Temporal Information Retrieval (TIR). In the TQIC subtask, we explore the rich temporal information in the labeled and unlabeled search queries. Semi-supervised and supervised linear classifiers are learned to predict the temporal classes for each search query. In the T...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Computer Speech & Language

دوره 25  شماره 

صفحات  -

تاریخ انتشار 2011